Data Scientist Capstone

By Marwan Saeed Alsharabbi

COVID-19 Analysis ,visualization & Prediction

Introduction

Coronavirus is a family of viruses that are named after their spiky crown. The novel coronavirus, also known as SARS-CoV-2, is a contagious respiratory virus that first reported in Wuhan, China. On 2/11/2020, the World Health Organization designated the name COVID-19 for the disease caused by the novel coronavirus. This notebook aims at exploring COVID-19 through data analysis and projections. The world is going through a difficult time and fighting with a deadly virus called COVID-19. Coronavirus disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It was first identified in December 2019 in Wuhan, China, and has resulted in an ongoing pandemic. The first case may be traced back to 17 November 2019.As of 8 June 2020, more than 7.06 million cases have been reported across 188 countries and territories, resulting in more than 403,000 deaths. More than 3.16 million people have recovered.

Step 1: Select a real-world dataset

I chose the Covid 19 data set from the following site(https://ourworldindata.org/coronavirus), and I will analyze the data, clean and perform some interesting processes and conclusions. I will strengthen the analysis and cleaning of global data. The data was downloaded from https://covid.ourworldindata.org/data/owid-covid-data.csv.

Data Sources:

Confirmed cases and deaths: Data comes from the European Centre for Disease Prevention and Control (ECDC) Testing for COVID-19: Data is collected by the Our World in Data team from official reports; you can find the source information for every country and further details in the post on COVID-19 testing. The testing dataset is updated around twice a week. Confirmed cases and deaths: Data is collected from a variety of sources (United Nations, World Bank, Global Burden of Disease, etc.)

License:

The information on this page is summarized from OWID's COVID-19 github page. All of Our World in Data is completely open access and all work is licensed under the Creative Commons BY license. More information about the usage of content can be found OWID github page.https://github.com/owid/covid-19-data/tree/master/public/data

Authors:

OWID's COVID19 github page the data has been collected, aggregated, and documented by Diana Beltekian, Daniel Gavrilov, Joe Hasell, Bobbie Macdonald, Edouard Mathieu, Esteban Ortiz-Ospina, Hannah Ritchie, Max Roser.

Step 2: Questions

Step 3: Problems and Modeling

1- Problem Question :

Created a Linear regression model and fit the model with owid COVID19 data, predicted the world death projection for the next 30 days. In this project I have used sklearn for creating Linear Regression model and created training split with 80 to 20%. The trained the model and predicted the death for next 30 days. Also created model using XGBoost for improving the linear regression model and fit the model with owid COVID19 data, predicted the world death projection for the next 30 days.

2- Problem Question:

I will create a model that can predict the risk for the Case Mortality Ratio of a Country utilizing its Life Expectancy, Percentage of Population over 65, and Percentage of diabetes_prevalence and cardiovasc_death_rate ?

It decided on using Population Over Age 65 and Obesity because in the world, over 80% of the deaths were in the population 65 and over, and the CDC has stated that 94% of deaths had some underlying health condition. We also used Life Expectency per country to account for possible deficiencies in the health care system. John Hopkins University has listed several diseases such as heart disease and Diabetes which are known to be exacerbated by Obesity. Our idea is that we can more accurately predict the Mortality Ratio of COVID-19 by using both population 65 and over and Obesity rather than just population 65 and over. This may show that creating a healthier population is the best way to prevent the devastation in future pandemics that the world is currently facing

Importing Libraries

Data from the file is read and stored in a DataFrame object - one of the core data structures in Pandas for storing and working with tabular data. We typically use the _df suffix in the variable names for dataframes.

Step 2: Perform data preparation & Cleaning

For now, let's assume this was indeed a data entry error. We can use one of the following approaches for dealing with the missing or faulty value:

It is not really logical to delete Nan values but replace with 0, because that would confirm that the result was static because the data is historical and adopts high time series, we cannot replace or delete even the most data in the rows because it is data historical

Numerical Features

I'd rather copy from the list than from Pandas Profiling

Categorical Features

Analysis preparation

Step 3: Perform exploratory Analysis & Visualization

Loading the cleaned Data and Exploring the histogram of the data looks like

It appears that each column contains values of a specific data type. For the numeric columns, you can view the some statistical information like mean, standard deviation, minimum/maximum values and number of non-empty values using the .describe method

It appears that each column contains values of a specific data type. You can view statistical information for numerical columns (mean, standard deviation, minimum/maximum values, and the number of non-empty values) using the .describe method.

Business Questions

Data Understanding

While we ahve looked at overall numbers for the cases, tests, positive rate etc., it would be also be useful to study these numbers on a month-by-month basis. The date column might come in handy here, as Pandas provides many utilities for working with dates.

You can see that it now has the datatype datetime64. We can now extract different parts of the data into separate columns, using the DatetimeIndex class

Question#1

How the many total population in each location by continents from our datase

Question#1

How the many total population in each location by continents from our datase

Question#2

The 10 top population total in each location by continents from our dataset

The 10 top population total in each location by continent 'Africa from our dataset

Create a data frame showing the total population of each continent

Question#3

Show countries in Asia,Europe and North America the total_cases and total_deaths,new_cases,total_tests, total_vaccinations by mean, and max

Question#4

Let's see the speed of transmission of the Corona virus between countries on the map .

Worldwide spread

Coronavirus is continuing its spread across the world with almost 100 million confirmed cases in 191 countries and more than two million deaths. and the virus has been detected in nearly every country, as these maps show.

We can see trend covid-19 moving to China -> Europe -> US

You can click each country and see the number representing the spread of the virus.

We can see trend covid-19 moving to China -> Europe -> US on map

Question#5

Let's see number of total_cases,total_deaths,total_deaths_per_million,test per confirmed(%) on map.

COVID-19 maps

Let's see number of confirmed cases on map.

For africa regions, the confirmed cases is lower than other continents, I guess this is due to the fact that number of tests is quite low.

You can click each country and see the number of the total confirmed cases.

We can see US,Brazil and India are distinctive

Let's see number of deaths on map.

You can click each country and see the number of the total deaths.

We can see US,Brazil,Mexico and India are distinctive

Let's see number of total deaths per million on map.

You can click each country and see the number of the total deaths per million

We can see that south,north America and europe has the most number of total deaths per million

Question#5

Top 15 countries for the total_cases,total_deaths,total_deaths_per_million,total_tests,people_fully_vaccinated and total_vaccinations on plot_hbar and Visulizing Treemaps

Top 15 countries

Visulizing Treemaps

We used this technique of data visulizing to display hierarchical data using nested rectangles,And accurately display multiple elements together

Line Plot function

We used this technique of data visualization to plot line display day by day trend ,And accurately display multiple elements together

Question#6

How many the New Deaths Smoothed day by day in continents

Question#7

How many the new vaccinations smoothed day by day in continents

Question#8

How many the New Tests Smoothed day by day in continents

Question#8

How many the positive_rate day by day in continents

Test population coverage

Question#9

find some gdp_per_capita and new_cases clusters over countries

Question#10

find some new_deaths_smoothed_per_million, handwashing_facilities and extreme_poverty clusters over countries

Question#11

find some new_deaths_smoothed_per_million, aged_70_older and population_density clusters over countries

Question#12

find some new_deaths_smoothed_per_million, life_expectancy and hospital_beds_per_thousand clusters over countries

Stringency Index and death rate correlation

Modeling

Correlation Analysis

Linear Regression-Forecast

Created a Linear regression model and fit the model with owid COVID19 data, predicted the world death projection for the next 30 days. In this project I have used sklearn for creating Linear Regression model and created training split with 80 to 20%. The trained the model and predicted the death for next 30 days. Also created model using XGBoost for improving the linear regression model and fit the model with owid COVID19 data, predicted the world death projection for the next 30 days.

k-nearest neighbors(KNN) algorithm

I will create a model that can predict the risk for the Case Mortality Ratio of a Country utilizing its Life Expectancy, Percentage of Population over 65, and Percentage of diabetes_prevalence and cardiovasc_death_rate ? 

It decided on using Population Over Age 65 and diabetes_prevalence cardiovasc_death_rate because in the world, over 80% of the deaths were in the population 65 and over, and the CDC has stated that 94% of deaths had some underlying health condition. We also used Life Expectancy per country to account for possible deficiencies in the health care system. John Hopkins University has listed several diseases such as heart disease and Diabetes which are known to be exacerbated by cardiovasc_death_rate and Obesity. Our idea is that we can more accurately predict the Mortality Ratio of COVID-19 by using both population 65 and over and Obesity rather than just population 65 and over. This may show that creating a healthier population is the best way to prevent the devastation in future pandemics that the world is currently facing

After viewing the graphs in Linear Regression-Forecast we the accuracy that XGboost algorithms can achieve with this data. . We will continue and see if our ML Algorithm can do better than we are expecting. We have initially chosen to use categorization with the HighRisk category as that may be more accurate than regression. Or can we use more precise algorithms to build a data-appropriate learning model?

Correlation Analysis

We will be using the diabetes, cardiovascular health, percent of poplation above 70 and any other data we find to be the most useful to see if we can get better results with these features.

Now we will split our data for the Machine Learning Algorithm using the High Risk Category as our target and Life_Expectancy,icu_patients ,diabetes_prevalence, and aged_65_older as features

It looks like despite our initial reservations that KNN was able to get a decent accuracy of 90.14 %

Let's test which k value gets us our best accuracy.

Interestingly a slightly better classification of 90.226% with k =7.

A further look at our predictions and Y_test values show that we get 90.226% simply by predicting almost everything as False so this model's features and data should be improved

Now we will try to test all the features we currently have and select through a greedy algorithm the best features utilizing KNN and all K-Ranges between 1 and 7

Looking at the data above we have gotten a bit better accuracy using k=7 with the following 4 features:

aged_70_older', 'icu_patients', 'life_expectancy', 'cardiovasc_death_rate'

The model using the extra features especially the human_development_index, smoker data and more recent target data has gotten better at predicting a countries rate of mortality vs population going from 90% to 93% accuracy depending on the randomization.

This may be due to different reporting systems for what is and is not a covid death and overall accuracy of the inputs.

Our original Hypotheses that Age and Obesity would be factors seem to have been proven true through the data, one might even be able to try regression on the normalized mortality / population and if we had the One World Data originally we may have even gone further and tried that as the correlation seems to be stronger

Inferences and Conclusion

Two questions guide this daily updated publication on the global COVID-19 pandemic:

How can we make progress against the pandemic? And, are we making progress? To answer these questions we need data. But data is not enough. This is especially true in this pandemic because even the best available data is far from perfect. Much of our work therefore focuses on explaining what the data can – and can not – tell us about the pandemic.

Our goal is two-fold:

To provide reliable, global and open data and research on how the COVID-19 pandemic is spreading, what impact the pandemic has, how we can make progress against the pandemic, and whether the measures countries are taking are successful or not; And to build an infrastructure that allows research colleagues – and everyone who is interested – to navigate and understand this data and research. Before we study how to make progress we should consider the more basic question: is it possible to do so?

The answer is very clear: While some countries have failed in their response to the pandemic, others met the challenge much more successfully. Perhaps the most important thing to know about the pandemic is that it is possible to fight the pandemic.

Responding successfully means two things: limiting the direct and the indirect impact of the pandemic. Countries that have responded most successfully were able to avoid choosing between the two: they avoided the trade-off between a high mortality and a high socio-economic impact of the pandemic. New Zealand has been able to bring infections down and open up their country internally. Other island nations were also able to almost entirely prevent an outbreak (like Taiwan, Australia, and Iceland). But not only islands were able to bend the curve of infections and prevent large outbreaks – Norway, Uruguay, Switzerland, South Korea, and Germany are examples. These countries suffered a smaller direct impact, but they also limited the indirect impacts because they were able to release lockdown measures earlier.

Together with colleagues at the Robert Koch Institute, the Chan School of Public Health, the UK Public Health Rapid Support Team, the London School of Hygiene and Tropical Medicine and other institutions we study countries that responded most successfully in detail.

Among the countries with the highest death toll are some of the most populous countries in the world such as the US, Brazil, and Mexico. If you prefer to adjust for the differences in population size you can switch to per capita statistics by clicking the ‘per million people’ tickbox.

We can see three different ways in which the pandemic has affected countries:

While some commentaries on the pandemic have the premise that all countries failed to respond well to the pandemic the exact opposite stands out to us: Even at this early stage of the pandemic we see very large differences between countries – as the chart shows. While some suffer terrible outbreaks others have managed to contain rapid outbreaks or even prevented bad outbreaks entirely. It is possible to respond successfully to the pandemic.

Fighting the pandemic: What can everyone of us do to flatten the curve? Some measures against the pandemic are beyond what any individual can do. The development of a vaccine, R&D in pharmaceutical research, building the infrastructure to allow large-scale testing, and coordinated policy responses require large-scale collaboration and are society-wide efforts. We will explore these later.

But, as with all big problems, there are many ways to make progress and some of the most important measures are up to all of us.

In the fight against the pandemic we are in the fortunate situation that what is good for ourselves is also good for everyone else. By protecting yourself you are slowing the spread of the pandemic.

You and everyone else have the same two clear personal goals during this pandemic: Don’t get infected and don’t infect others.

To not get infected you have to do what you can to prevent the virus from entering your body through your mouth, nose, or eyes. To not infect others your goal is to prevent the virus from traveling from your body to the mouth, nose or eyes of somebody else.

What can you do? How can all of us – you and me – do our part to flatten the curve? The three main measures are called the three Ws: Wash your hands, wear a mask, watch your distance.

References and Future Work

1- https://www.geeksforgeeks.org/python-programming-language/?ref=leftbar

2- https://www.python-course.eu/python3_class_and_instance_attributes.php

3- https://thispointer.com/data-analysis-in-python-using-pandas/

4- https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas

5- https://ourworldindata.org/coronavirus

6- https://covid19.moh.gov.sa/

7-https://github.com/

8-https://www.kaggle.com/